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Abstract — We propose a novel hierarchical model for multitask bipartite ranking. The proposed approach combines a matrix-variate 
Gaussian process with a generative model for task-wise bipartite ranking. In addition, we employ a novel trace constrained variational 
inference approach to impose low rank structure on the posterior matrix-variate Gaussian process. The resulting posterior covariance 
function is derived in closed form, and the posterior mean function is the solution to a matrix-variate regression with a novel spectral 
elastic net regularizer. Further, we show that variational inference for the trace constrained matrix-variate Gaussian process combined 
with maximum likelihood parameter estimation for the bipartite ranking model is jointly convex. 

Our motivating application is the prioritization of candidate disease genes. The goal of this task is to aid the identification of unobserved 
associations between human genes and diseases using a small set of observed associations as well as kernels induced by gene- 
gene interaction networks and disease ontologies. Our experimental results illustrate the performance of the proposed model on real 
world datasets. Moreover, we find that the resulting low rank solution improves the computational scalability of training and testing as 
compared to baseline models. 

Index Terms — Gaussian process. Multitask learning. Bipartite ranking. Trace norm. 

> 



1 Introduction 

RANKING is the task of learning an ordering for a 
set of items. In bipartite ranking, these items are 
drawn from two sets, known as the positive set and 
the negative set. Bipartite ranking involves learning an 
ordering that ranks the positive items ahead of the 
negative items |1|, |2J, ]j3j, Q. This paper proposes 
a generative model for bipartite ranking and an ex- 
tension of bipartite ranking to the multitask domain. 
Our approach combines a latent multitask regression 
function with task-wise ordered observation variables. 
We employ a non-parametric matrix-variate Gaussian 
process prior for the multitask regression. Further, we 
propose a novel trace constrained variational inference 
approach that imposes useful low rank structure on the 
multitask regression. 

Multitask learning (MTL) exploits inter-task relation- 
ships to improve the prediction quality over single task 
learning |5|, |6|. An important class of methods in this 
domain are based on the matrix-variate Gaussian process 
(MV-GP) and closely related models for vector valued 
reproducing kernel Hilbert space (RKHS) function esti- 
mation |7|. The MV-GP is an extension of the matrix- 
variate Gaussian distribution |8| to (possibly) infinite 
dimensional feature spaces. Alternatively, the MV-GP 
may be understood as an extension of the scalar valued 
Gaussian process |9| to vector valued responses. The 
MV-GP is a useful model for learning multiple corre- 
lated tasks, as it jointly models the correlations across 
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examples, and across tasks. The MV-GP has been ap- 
plied to link analysis, transfer learning flO], collaborative 
prediction |11| and multitask learning [12] among other 
applications. 

Our motivating application is the prioritization of 
disease genes. Genes are segments of DNA that de- 
termine specific characteristics; over 20,000 genes have 
been identified in humans, which interact to regulate 
various functions in the body. Researchers have iden- 
tified thousands of diseases, including various cancers 
and respiratory diseases such as asthma |13 |, caused by 
mutations in these genes. The standard approach for 
discovering disease-causing genes are genetic association 
studies |14|. However, these studies are often tedious 
and expensive to conduct. Hence, computational meth- 
ods that can reduce the search space by predicting a 
prioritized list of candidate genes for a given disease 
are of significant scientific interest. 

The disease-gene prioritization task has received a 
significant amount of study in recent years |15( , |16| , |17| , 
p8|. The task is challenging because all the observed 
responses correspond to known associations and the 
states of the unobserved associations are unknown, i.e., 
there are no reliable negative examples. Such problems 
are also known single class or positive-unlabeled (PU) 
learning tasks [19l. A common approach for this task 
is to learn a model that that maximizes the classification 
accuracy between the positive class and the unlabeled 
class |20|. In the collaborative filtering literature, such 
single class tasks have also been addressed using the 
low rank matrix factorization approach |21|. 

Recent work suggests that a model trained to rank the 
positive class ahead of the unknowns can be effective for 
ranking the unknown positive items ahead of unknown 
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negative items p9[ . Further, the scientific use case for 
gene prioritization depends on (and is evaluated by) the 
accuracy of the ranked list produced |15,| , |17|. For these 
reasons, disease-gene prioritization is well posed as a 
bipartite ranking task. A low rank model induces sig- 
nificant correlation between the predictions of different 
tasks. This assumption matches observations made by 
domain experts that the gene ranking profiles of diseases 
are exhibit strong correlations |18|, |22|. This assumption 
is further validated by the empirical performance of the 
low rank model. 

Low rank structure is a typical constraint in sev- 
eral real world multitask and matrix learning problems. 
However, despite its use for multitask learning, the stan- 
dard MV-GP does not model low rank structure. Further, 
memory requirements for the MV-GP scale quadratically 
with data size, and naive computation scales cubically 
with data size. These computational properties limit 
the applicability of the MV-GP to large scale problems. 
The hierarchical factor Gaussian process (factor GP) has 
been proposed as an alternative for problems with low 
rank structure p3| , p4| . Here, latent row and column 
factors are drawn from a Gaussian process prior. The 
result is a model with mean of user-selected rank. We 
argue that the factor GP is an unsatisfactory model for 
two reasons: (i) the resulting posterior mean function 
is the solution of a non-convex optimization problem, 
and (ii) the posterior covariance is often intractable. We 
will show that the proposed trace constrained MV-GP 
provides the same low rank structure benefits without 
the drawbacks of the factor GP model. The proposed 
variational inference is jointly convex in the mean and 
the covariance, and the posterior covariance is given in 
closed form. 

As a computational model, the optimization problem 
for the mean function of the trace constrained MV-GP is 
equivalent to kernel multitask learning with the sum of 
squared errors cost function \25 ] combined with a novel 
regularizer We will show that this regularizer can be 
expressed as a weighted sum of the Hilbert and the trace 
norms. We call the resulting regularization the spectral 
elastic net, highlighting its relationship to elastic net 
regularization for variable selection in finite dimensional 
linear models |26|. To the best of our knowledge, ours is 
the first application of the spectral elastic net regularizer 
to matrix estimation and kernel multitask learning. 

This paper proposes a novel generative model for mul- 
titask bipartite ranking and a novel constrained varia- 
tional inference approach for the matrix variate Gaussian 
process applied to the disease-gene prioritization task. 
The main contributions of this paper are as follows: 

• We propose a novel variational inference approach 
for matrix-variate Gaussian process regression using 
a trace norm constraint (section |3). This constraint 
typically results in a regression matrix of low rank. 

• We propose a novel generative model for bipartite 
ranking (section |4|. To our knowledge, ours is the 
first such generative model proposed in the litera- 



ture. 

• We show that variational inference for the latent 
regression model combined with maximum likeli- 
hood parameter estimation for the bipartite ranking 
is jointly convex (section |4.3| . 

• We evaluate the proposed model empirically and 
show that it outperforms the state of the art domain 
specific model for the disease-gene prioritization 
task (section [5|. 

The Kronecker product and the vec operator: We 
will make significant use of the Kronecker product and 
the vec(-) operator Given a matrix A e vec(A) G 

is the vector obtained by concatenating columns 
of A. Given matrices A e M^^'? and B e RP'""^', the 
Kronecker product of A and B is denoted as A (g) B G 
jjpp xQQ ^ useful property of the Kronecker product 
is the identity: vec(AXB) = (B^® A)vec(X), where X e 

2 The matrix-variate Gaussian process 

The matrix-variate Gaussian process (MV-GP) is a collec- 
tion of random variables defined by their joint distribu- 
tion for finite index sets. Let M 9 m be the set represent- 
ing the rows (tasks) and N 9 n be the set representing the 
columns (examples), with sizes |M| = M and |N| = N. 
Let X - gV{(t),JC), where GV {(j),K.) denotes the MV- 
GP with mean function and covariance function JC. As 
with the scalar GP, the MV-GP is completely specified 
by its mean function and its covariance fimction. These 
are defined as: 

(j){m, n) — E[X(m, n)] 
/C((m, n), (to', n')) = 

E[(X(to, n) — (/'(to, n))(X(TO', n ) — 4>{m' , n'))], 

where E[ ] is the expected value. For a finite index set 
M X N, define the matrix X e R^'^^^ such that x,n,n = 
X(m,n), then vec(X) is a distributed as a multivariate 
Gaussian with mean vec(*) and covariance matrix K, 
i.e., vec(X) ^ A/'(vec(*),K), where 4>m,n — '/'('ti, n), 
* e M^'^x^, and k,^m,n).{^',n') - /C((TO,n)(TO',n')), K e 

^MNxMN 

The covariance function of the prior MV-GP is as- 
sumed to have Kronecker product structure f7\, fTTj. 
The Kronecker product prior covariance captures the 
assumption that the prior covariance between matrix 
entries can be decomposed as the product of the row 
and column covariances. The Kronecker prior covariance 
assumption is a useful restriction as: (i) it improves 
computational tractability, enabling the model to scale 
to larger problems than may be possible with a full 
joint prior covariance, (ii) the regularity imposed by 
the separability assumption improves the reliability of 
parameter estimates even with significant data sparsity, 
e.g., when the observed data consists of a single matrix 
(sub-)sample, and (iii) row-wise and column-wise prior 
covariance functions are often the only prior information 
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Fig. 1. Plate model of the matrix-variate Gaussian pro- 
cess. Z{m, n) is the hidden noise-free matrix entry. 




available. A closely related concept is kernel MTL with 
separable kernels. This is a special case of vector valued 
regularized function estimation where the joint kernel 
decomposes into the product of the row kernel and the 
column kernel f7\, fSS]. Learning in these models is 
analogous to inference for the MV-GP, with the prior row 
(resp. column) covariance matrix used as row (resp. col- 
umn) kernels. 

Define the row covariance (kernel) function /Cm : M x 
M I— )• M and the column covariance function /C^ : N x N i— >^ 
K. The joint covariance function of the MV-GP with 
Kronecker covariance decomposes into product form 
as /C((m, n), (m', n')) = /CmC^, m')/Cpj(n, n'), or equiva- 
lently, /C = /C^ (81 /Cm. Hence, for the random variable 
X gV {(p, /Cn (g) /Cm) and a finite index set M x N, 
vec(X) ~ AA(vec(*),K^, ®K„), where Km G M^^-fxA^ is 
the row covariance matrix and and e M^^^ is the 
column covariance matrix. 

This definition also extends to finite subsets that are 
not complete matrices. Given any finite subset T = {t = 
{m,n) |m e M,n e N}, where T = |T| < M x N, the 
vector X = [xt^ ■ ■ -Xt^] is distributed as x ^ A/'(4't,K). 
The vector $t = [</"(!) ■ ■ • 4>{T)] G are arranged from 
the entries of the mean matrix corresponding to the set 
t G T, and K is the covariance matrix evaluated only on 
pairs i' e T X T. 

Our goal is to estimate an unknown response matrix 
R e jjAfxw y^yf^ rows and N columns. We assume 
observed data consisting of a subset of the matrix entries 
r = [rtj . . .rtj,] collected into a vector. Note that T C 
M X N; hence, the data represents a partially observed 
matrix. Our generative assumption proceeds as follows 
(see Fig. [ij: 

1) Draw the fimction Z from a zero mean MV-GP Z ^ 

gv (o,/c^®/Cm). 

2) Given Zm.n = Z{m,n), draw each observed re- 
sponse independently: „^ A/" (z^.n, ct^). 

Hence, Z e M'*'^^^ with entries z^.n = Z{m,n) may 
be interpreted as the latent noise-free matrix. The infer- 
ence task is to estimate the posterior distribution 
where P = {r, T}. The posterior distribution is again a 
Gaussian process, i.e., Z|2? ^ gV {(j>, S), with mean and 



Fig. 2. Plate model of the factor Gaussian process, 
covariance fimctions: 

(?;>(m,n) = KT(m,n)[K + ct^ItI^V (1) 
E((m,n),(m',n')) = (2) 
fc((m, n), (m', n')) - KT(m, 7i)[K + ct^ItI^^KjIto, n)^ 

where r e M-^ corresponds to the vector of responses 
for all training data indexes (to, n) E T. The covariance 
function Kt(to, n) corresponds to the sampled covari- 
ance matrix between the index (to, n) and all training 
data indexes {m',n') G T, K is the covariance matrix 
between all pairs {m,n), {m' ,n') G T x T, and It is 
the T X T identity matrix. The closed form follows 
directly from the definition of a MV-GP as a scalar GP 
|9| with appropriately vectorized variables. The model 
complexity scales with the number of observed samples 
T. Storing the kernel matrix requires 0{T^) memory, 
and the naive inference implementation requires 0{T^) 
computation. 

Although the matrix-variate Gaussian process ap- 
proach results in closed form inference, it does not 
model in low rank matrix structure. The factor GP is 
a hierarchical Gaussian process model that attempts to 
address this deficiency in the MV-GP |23|, |24|. With a 
fixed model rank F, the generative model for the factor 
GP is as follows (see Fig. |2j: 

1) For each / G {1 . . . F}, draw row functions: ~ 
gv (0, /Cm). Let u,n e with entries = (to). 

2) For each / G {1 . . . F}, draw column functions: 
yf ~ gv (0,/Cn)- Let Vn e with w,{ = Vf{n). 

3) Draw each matrix entry independently: r„i „ ~ 
U{ul^Vn,a^) V(m,n) e T. 

where u,n is the to* row of U = [m^ . . .u^] e E^^^, 
and Vn is the n**^ row of V = [v'^...v^] G R^''''. 
The maximum-a-posteriori (MAP) estimates of U and 
V can be computed as the solution of the following 
optimization problem: 

U*, V* argmin^ (r,„ „ - ■uj,jt'„)^ 

''met 
+ tr(UtK-iU) + tr(VtK-iV) (3) 
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where tr(X) is the trace of the matrix X. However, 
the joint posterior distribution of {U, V} and the dis- 
tribution of Z = UVt are not Gaussian, and the re- 
quired expectations and posterior distributions are quite 
challenging to characterize. A Laplace approximation by 
proposed by [ 23J and |24J utilized sampling techniques. 
Statistically, the factor GP may be interpreted as the sum 
of rank-one factor matrices. Hence, as the rank F — > oo, 
the law of large numbers can be used to show that the 
distribution of Z converges to GV (0, K.^ (g) /Cm) p3). 



2.1 Spectral norms of compact operators 

The mean function of the MV-GP is an element of the 
Hilbert space defined by the kernels (covariances). We 
provide a brief overview of some relevant background 
required for defining this representation and for defining 
relevant spectral norms of compact operators. We will 
focus on the MV-GP with Kronecker prior covariance. 
Our exposition is closely related to the approach outlined 
in |27|. Further details may be found in |28|. 

Let HjCu denote the Hilbert space of functions induced 
by the row kernel /Cm. Similarly, let H/Cn denote the 
Hilbert space of functions induced by the column kernel 
/Cn. Let X e T-Licm arid y e V-k:,^ define (possibly infinite 
dimensional) feature vectors. The mean function the MV- 
GP is defined by a linear map W : Hicm ^ "Hkh' i-^-' f^e 
bilinear form on Hic = "Hkm ^ '^Ku given by: 

(j>{m,n) = (a;„, VFy„)^^^ 

Let B denote the set of compact bilinear operators map- 
ping HiCu ^Kn- a compact linear operator W £ B 
admits a spectral decomposition p7| with singular val- 
ues given by {Ci(Vt^)}. 

The trace norm is given by the ell-1 norm on the 
spectrum of W: 
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(4) 



When the dimensions are finite, Q is the trace norm 
of the matrix W e M^mx^n jhis norm has been widely 
applied to several machine learning tasks including mul- 
titask learning |29J , |30J and recommender systems |3ll. 
In addition to the trace norm, a common regularizer is 
the induced Hilbert norm given by the ell-2 norm on the 
spectrum of W: 
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(|5} is equivalent to the matrix Frobenius norm for finite 
dimensional W e M^"^^" computed as: 
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Let L{(t), r, T) represent the loss function for a finite set 
of training data points T S M x N and Q{4') be a spectral 
regularizer. We define the regularized risk fimctional: 

L(0,r,T) + AQ(0) 

where A > is the regularization constant. A representer 
theorem exists, i.e., the function </> that optimizes the 
regularized risk can be represented as a finite weighted 
sum of the kernel functions evaluated on training data 
p7) . Employing this representer theorem, the optimizing 
function can be computed as: 

<j){m,n)= E am',n' ICu{m,m')JC^{n,n') 

m'eM n'6N 

= KM(m)AKM(n)t (6) 

where A e ^mxn jg ^ parameter matrix, KM(m) is the 
kernel matrix evaluated between m and m' € M, i.e., the 
row of Km, and {n) is the kernel matrix evaluated 
between n and all n' € N. 

Computing the norms: The Hilbert norm can be 
computed as: 

= vec(A)^(K^(»K„)vec(A) 
= tr(AtK„AK^). 
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The trace norm can be computed using a basis transfor- 
mation approach |27| or by using the low rank "varia- 
tional" approximation pO) . 

Basis transformation: With the index set fixed, de- 



fine bases Gm G 
it 



p Af X Dv 



and Gfj e 



t>NxDk 



such that 



Km = GmGm and K^ = GnGJ. One such basis is 
the square root of the kernel matrix Gm = Km and 

Gn — Kn . When the feature space is finite dimensional, 
the feature matrices Xm e M*^><-D" and e m,n><-Dn ^re 
also an appropriate basis. The mean function can be re- 
parametrized as (/'(?Ti, n) = Gm {m)BG^{n)\ where B e 
jjDmxDn jsJq^^ ii^Q trace norm is given directly by the 
trace norm of the parameter matrix, i.e., ||'/'||^^ ^ = I1B|1^. 

Low rank "variational" approximation: The trace 
norm can also be computed using the low rank ap- 
proximation. This is sometimes known as the variational 
approximation of the trace norm 130J. 



\m 



arg mm 



/=1 
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+ \\v^ 



,2 
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where {u,v) = X]/=i u-^v^ . This approach is exact when 
F is larger than the true rank of (f>. Note that this 
is the same regularization that is required for MAP 
inference with the factor GP model l|3|. Hence, when F is 
sufficiently large, the regularizer in the factor GP model 
is the trace norm. Unfortunately, it is difficult to select 
an appropriate rank a-priori, and no such claims exist 
when F is insufficiently large. With finite dimensions, 
the variational approximation of the trace norm is given 
by: 



|W||^ = argmin - 
w=uvt ^ 



lUl 



IVI 
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where U e M^m^^-P" and V e M.^^''^. This sum of 
factor norms has proven effective for the regularization 
of matrix factorization models and other latent factor 
problems ||32|. 



3 Trace norm constrained inference 
for the mv-gp 

A generative model for low rank matrices has proven to 
be a challenging problem. We are unaware of any (non- 
hierarchical) distributions in the literature that generate 
sample matrices of low rank. Hierarchical models have 
been proposed, but such models introduce issues such 
as non-convexity and non-identifiability of parameter 
estimates. Instead of seeking a generative model for 
low rank matrices, we propose a variational inference 
approach. We constrain the inference of the MV-GP such 
that expected value of the approximate posterior distri- 
bution has a constrained trace norm (and is generally of 
low rank). In contrast to standard variational inference, 
this constraint is enforced in order to extract structure. 

The goal of inference is to estimate of the posterior 
distribution p(Z|2?). We propose approximate inference 
using the log likelihood lower bound ]33) : 

lnp(y|2?) > Ez[lnp(y,Z)] -Ez[lnp(Z)] (9) 

Our approach is to restrict the search to the space of 
Gaussian processes q{Z) = QV S) subject to a trace 
norm constraint HV"!!^ ^ < C as defined in Q. With 
no loss of generality, we assume a set of rows M and 
columns N of interest so T e M x N. Let Z e M*^^^ be 
the matrix of hidden variables. 

Given finite indexes, q is a Gaussian distribution 
q(z) ^ 7V(t/>,S) where z = vec(Z) e R'^'"'^, cj) ^ 
vec(*) e M*^><^, and S e rMNxmn_ ^et P e R^xmn 
be a permutation matrix such that = PSP^^ is 
the covariance matrix of the subset of observed entries 
t £ T, and = PKP^ is the prior covariance of the 
corresponding subset of entries. Evaluating expectations, 
the lower bound l|9} results in the following inference 
cost function (omitting terms independent of ip and S): 



niax - —r 
V-.s 20-2 



m,n£T 



^tr(S.) 



-i^^K^V - ^tr(K-iS) +ln|S| 

s.t. IIV-ll^c,* < C 

where |X| is the determinant of matrix X. 

First, we compute gradients with respect to S. After 
setting the gradients to zero, we compute: 

-1 



K 



1 



ptp 



= K - KP1^ K 



PK, 



(10) 



The second equality is a consequence of the matrix inver- 
sion lemma. Interestingly, | (T0) is identical the posterior 
covariance of the unconstrained MV-GP l|2|. 



Next, we collect the terms involving the mean. This 
results in the optimization problem: 



ip* ^ arg min 



— y 



s.t. Il^ll 



. < C 



(11) 



This is a convex regularized least squares problem with 
a convex constraint set. Hence, (TT) is convex, and 
is unique. Using the Kronecker identity, we can re- 
write the cost function in parameter matrix form. We 
can also replace the trace constraint with the equivalent 
regularizer weighed by ^. Multiplying through by cr^ 
leads to the equivalent cost: 



yf* = arg min 
\]> 2 



O ^ ] {'>'m,n 



'07n,n) 



/tT<r-i 



tr(*TK-^VI>K-^)+M'Tl|V^| 



(12) 



Applying the representation ||6|, we recover the para- 
metric form of the mean function ij] € JC^ ® as 
* = K„AKn where A e M*^^^. We may also solve 
for A directly: 



• 1 

A — arg mm - 
A 2 



(KmAK, 



(13) 



where tlj{A) is the mean function corresponding to the 
parameter A (see l|6|). The representation of the mean 
function in functional form is useful for avoiding re- 
peated optimization when testing a trained model with 
different evaluation sets. 

The approximate posterior distribution is itself a finite 
index set representation of an underlying Gaussian pro- 
cess. 

Theorem 1. The posterior distribution q — M [tp, S) is 
the finite index set representation of the Gaussian process 
QV {il), S) where the mean function is given by | [T3) and 
the covariance function S is given by (|2|. g = QV {ip, S) 
is the unique posterior distribution that maximizes the lower 
bound of the log likelihood (|9j subject to the trace constraint 



u\\. 



< c. 



Sketch of proof: Uniqueness of the solution follows 
from ||9|, which is jointly convex in {-0, S}. To show 
that the posterior distribution is a Gaussian process, we 
simply need to show that for a fixed training set V, the 
posterior distribution of the superset (M x N) U {m',n') 
has the same mean function and covariance function. 
These follow directly from the solution of | (T3) and from 
l|2} (see (To)). In addition to showing uniqueness. Theo- 
rem [T| shows how the trained model can be extended to 
evaluate the posterior distribution of data points not in 
training. 

In the case where a basis for Km and Kpj can be found. 



1 1 1 1 may be solved using the matrix trace norm approach 
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directly (see section [Zlj l: 
. 1 



B* 



argmm - 
B 2 



(GmBGJ),^ 



B 



IBI 



(14) 



pNxDn 



where G„ e rMxDm jg ^j^g ]-,jjsis for K„ and e 
is the basis for Kpj. B e k^mxDn jg t]-ig estimated 
parameter matrix. The mean function is then given by 

^711, n ^ (G^ 



.BG 



Spectral elastic net regularization: The regularization 
that results from the constrained inference has an in- 
teresting interpretation as the spectral elastic net norm. 
As discussed in section 2.1 the mean function may be 
represented as 'il){m,n) = {xm^Wy^)^^ for x e T-Lku 
and y G Ti-Kt^- The spectral elastic net is given as a 
weighed sum of the ell-2 norm and the ell-1 norms on 
the spectrum {i^i(W^)}: 

D D 

Qc.A^) = aY.^Kw) + PY.^^{w) (15) 

1=1 1=1 

where a and /3 > are weighting constants. The naming 
is intentionally suggestive of the analogy to the elastic 
net regularizer |26|. The elastic net regularizer is a 
weighted sum of the ell-2 norm and the ell-1 norms of 
the parameter vector in a linear model. The elastic net 
is a tradeoff between smoothness, encouraged by the 
ell-2 norm, and sparsity, encouraged by the ell-1 norm. 
The elastic net is particularly useful when learning with 
correlated features. The spectral elastic net has similar 
properties. The Hilbert norm encourages smoothness 
over the spectrum, while the trace norm encourages 
spectral sparsity, i.e., low rank. To the best of our knowl- 
edge, this combination of norms is novel, both in the 
matrix estimation literature and in the kernelized MTL 
literature. When the dimensions are finite, (TS) is given 
by a weighted sum of the trace norm and the Frobenius 
norm of the parameter matrix. 

We propose a parametrization of the mean function 
inference inspired by the elastic net [j26il. Let A = (7^(l+/^t) 
and a = i where A > and a e [0, 1]. The loss 



function JMb can be parametrized as: 



B* = arg mill ^ (r„,„ 

ra,n^ I 

A(l-a) 
2 ' 



(G„BGj)m,n) 



B||^ + Aa||B||,. (16) 



The same parametrization can also be applied to the 
equivalent representations given in |[12| and (IS) . This 
spectral elastic net parametrization clarifies the tradeoff 
between the trace norm and the Hilbert norm. The trace 
norm is recovered when a = 1, and the Hilbert norm is 
recovered for a = 0. The spectral elastic net approach 
is also useful for speeding up the computation with 
warm-start i.e. for a fixed a, we may employ warm-start 
for decreasing values of A. Computation of the spectral 



elastic net norm follows directly from the Hilbert and 
trace norms. From the variational approximation of the 
trace normjsl, it is clear that MAP inference for the 
factor GP dsTis equivalent to inference for the mean 
of the trace constrained MV-GP 1 12 1 in the special case 
where a = 1 (assuming that the non-convex optimization 
||3) achieves the global maximum). 

Non-zero mean prior: To simplify the explanation, we 
have assumed so far that the prior Gaussian process has 
a zero mean. The non-zero mean case is a straightfor- 
ward extension |9|. We include a short discussion for 
completeness. Let 6,„ „ represent the mean parameter of 
the Gaussian process prior, i.e., Z ^ (6, /C^ /Cm). 
The posterior covariance estimate remains the same, 
and the posterior mean computation requires the same 
optimization, but with the observation r^.n replaced 
by r„i^n — ?'™,ri — b,n^n- The resulting posterior mean 
must then be shifted by the bias, i.e., E[Z„i_„|2?] = 



If desired, this parameter may be easily 



estimated. Suppose we choose to model a row- wise bias. 
Let T„j = {(m, n)\{m' , n) e T, m' = to}, then solving the 
straightforward optimization, we find that the the row 
bias estimate is given by: 



6m = 



|T„ 



4 Bipartite ranking 

Bipartite ranking is the task of learning an ordering 
for items drawn from two sets, known as the positive 
set and the negative set, such that the items in the 
positive set are ranked ahead of the items in the negative 
set Ijlj, Q, |j3|, Q. Many models for bipartite ranking 
attempt to optimize the pair-wise mis-classification cost, 
i.e., the model is penalized for each pair of data points 
where the positive labeled item is ranked lower than 
the negative labeled item. Although this approach has 
proven effective, the required computation is quadratic 
in the number of items. This quadratic computation cost 
limits the applicability of pair-wise bipartite ranking to 
large scale problems. 

More recently, researchers have shown that it may be 
sufficient to optimize a classification loss, such as the 
exponential loss or the logistic loss, directly to solve the 
bipartite ranking problem ||3|, |j4j. This is also known 
as the point-wise approach in the ranking literature. In 
contrast to the point-wise and pair-wise approach, we 
propose a list-wise bipartite ranking model. The list-wise 
approach learns a ranking model for the entire set of 
items and has gained prominence in the learning to rank 
literature p4) , 135] as it comes with strong theoretical 
guarantees and has been shown to have superior empir- 
ical performance. 

Our approach is inspired by monotone retargeting 
(MR) 1 35 1, a recent method for adapting regression 
to ranking tasks. Although many ranking models are 
trained to predict the relevance scores, there is no need 
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sorted vector v, i.e., the threshold for transition between 
+1 and —1. Then u v implies that: 



3be 



s.t. 



mm Uj 

l<i<k 



>b> 



k<j<c 



(17) 



Fig. 3. Monotone re-targeting searclnes for an order 
preserving transformation of the target scores tinat may 
be easier for the regressor to fit. 



to fit scores exactly. Any scores that induce the cor- 
rect ordering will suffice. MR jointly optimizes over 
monotonic transformations of the target scores and an 
underlying regression function (see Fig.|3]|. We will show 
that maximum likelihood parameter estimation in the 
proposed model is equivalent to learning the target 
scores in MR. Though we show this equivalence for the 
special case of bipartite ranking with square loss, the 
relation holds for more general Bregman divergences 
and ranking tasks. This extension is beyond the scope 
of this paper. In addition to improving performance, 
MR has favorable statistical and optimization theoretic 
properties, particularly when combined with a Bregman 
divergence such as squared loss. To the best of our 
knowledge, ours is the first generative model for list- 
wise bipartite ranking. 

4.1 Background 



There are several ways to permute a sorted binary 
vector y e while keeping all its values the same. 
These are permutations that separately re-order the +ls 
at the top of y„^ and the —Is at the bottom. Given y = 
sort(y), we represent the set of permutations that do not 
change the value the sorted y as F = {7(-) | 7(?/) = y}- 
It follows that the set F contains all permutations that 
satisfy 7('u) y. In other words, all v that satisfy v y 
can be represented as v — ■y{v) for some 7 G F. 

We propose a representation for compatible vectors 
that reduces to permutations of isotonic vectors. 

Proposition 3. Let v e B^. Any m e M'' that satisfies u v 
can be represented by u — 7(m) where 7 e F, the set F — 
{"/{■) I 7('") = ^nd u e wl 

Sketch of proof: First, we note that by definition of 
compatibility for binary vectors 1 17 1, any permutation 
7 G F satisfies 7(m) ~^ v. Next, we note that the 
sorted order is a member of the permutation set. This 
representation is unique when u satisfies strict ordering. 

The set is a convex cone. To see this, note that the 
convex composition x — au + {1 — a)v, a E [0,1] of two 
isotonic vectors tt G Mj* and d G Mj* preserves isotonicity. 
Further, any scaling ax where a > preserves the or- 
dering. Let A'' be the set of probability distributions, i.e., 
Va; G A'^, Xi > and J2t=i ^ 1- The set of probability 
distributions in sorted order is given by AJ^ — n A'' 
so for each x G Aj^, x G A"* and Xi > xj Vi > j. 

Lemma 4 (Representation of A'^ [35 1). The set AjJ of all 

discrete probability distributions of dimension d that are in 
descending order is the image Cx where x £ A'^ and C is 
an upper triangular matrix generated from the vector v = 



Let B = {+1, -1}, and let B^ be the set of binary isotonic {1, 1 ... ^} such that C{i, :) = {Oy-' x v{i :) 
vectors (binary vectors in sorted order), i.e., any v G Bj_' 
satisfies v G B"* and Vi > Vj Wj > i. Similarly, let MJ^ be 



the set of real valued isotonic vectors, i.e., any v G MJ^ 
satisfies t; G M'' and Vi > Vj Vj > i, then v satisfies 
partial order. We state that v satisfies total order or strict 
isotonicity when the ordering is a strict inequality, i.e., 
Vi > Vj Vj > i. We denote a vector in sorted order as 
V = sort(T;). 

Compatibility is a useful concept for capturing the 
match between the sorted order of two vectors. 



4.2 Generative model 

Let ym,n G B be the label for item n in m* task 
and let T.,„ = {n | {m,n) G T} be the set of items in 
TO* task so \Tm\ — Tm- We define the negative set as 
= {{m,n) G T\ym.n = —1} and the positive set 
as D+ = {(to, n) G T\ym,n = +!}■ For the m* task, 
the negative set is defined as D^^ = {n|(TO,7i) G D^} 
and the positive set as D+ = {n\{m,n) G D+} so that 
T„, — D+ U D^^j. The vector of labels for the 



Definition 2 (Compatibility |[34|). u is compatible with the 
sorted order of v (denoted as u v) if for every pair of 
indexes vi > vj implies ui > uj. 

Compatibility is an asymmetric relationship, i.e., u 

V ^ V u. It follows that sorted vectors always satisfy 
compatibility, i.e., if u and v are two sorted vectors, then 
by definition |2j it Compatibility is straightforward 
to check when the target vector is binary. Let it G M'* and 

V G Bf, and let k be the be the number of +rs in the 



task are 

given by y„ G B^'-. 

We propose the following generative model for y^^: 

p(y™km)(x n n %..>r^,,y (18) 

where is the indicator function defined as: 



1 if b evaluates to true , 
otherwise. 
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following optimization problem: 



{D+,D-} 



Fig. 4. Plate model of generative bipartite ranking with 
the latent matrix-variate Gaussian process. 



For clarity, we have suppressed the dependence of 
PiVmlrm) on the sets {D+ , 

It is instructive to compare the form of the generative 
model | (T8) to the area under the ROC curve (AUC) given 
by the fraction of correctly ordered pairs: 



D+ D,-^ 



E E 



(19) 

We note that p(2/„Jr,„) is nonzero if and only if r„i sat- 
isfies the ordering defined by {D+ , D^j}. It follows that 
any vector r„i s.t. p(y„Jr„j) is nonzero also maximizes 
the AUC. 

We can now combine the bipartite ranking model with 
the latent regression model. The full generative model 
proceeds as follows (see Fig. |4j: 

1) Draw the latent variable Z from a zero mean MV- 

GP as Z ^gV{0,IC^<l^lCu)■ 
2) Given z„,.„ ~ Z{m,n), draw each score vector 

independently as r„i,„ TV (zm,„, cr^) . 
3) For each task m e M, draw the observed response 

vector p{y^\rjn) as given by | (T8) . 



4.3 Inference and parameter estimation 

We utilize variational inference to train the underlying 
multitask regression model and maximum likelihood 
to estimate the parameters of the bipartite ranking 
model. This is equivalent to the variational approxi- 
mation q{r,Z) — l[r=r*]Q{Z) where r = {r^}. The 
variational lower bound of the log likelihood l|9| is given 
by: 

lnp{y\V) > \npiy\r) + Ez[lnp(r, Z)] - Ez[lnp(Z)] 

where y = {y„i}- As outlined in section |3| we restrict 
our search to the space for q{Z) of Gaussian processes 
q{Z) = QV {ip, S) subject to a trace norm constraint 
IIV'lljc* — Evaluating expectations (and ignoring 
constant terms independent of {r, ip, S}) results in the 



mm 



EE Ein(i[ 

rnGM i^ot, reD~ 



2 ^ ^ (^m,n 



2(72 



1 

2^2 



-^t/>^K" V + ^tr(K-iS) - In |S| 



s.t. \m 



< c 



(20) 



Inference and parameter estimation follow an alter- 
nating optimization scheme. We alternately optimize 
each of the parameters {r, i/j, S} till a local optima is 
reached. Following section |3| it is straightforward to 
show that the optimal S is given in closed form 1 10 1 
and is independent of {r,ip}. Hence, model training 
requires alternating between optimizing r*\ijj and opti- 
mizing ip*\r. We will show that l|20| is convex, and the 
alternating optimization approach achieves the global 
optimum. The optimization for tl;*\r follows directly 
from the discussion in section [3] Hence, we will focus 
our efforts on the optimization of r*\i/j. 

Collecting the terms of | [20) that are dependent on r 
results in the following loss function for r\ip: 

Order violation penalty 



" E E E i"(iK.,>r„,„] 

meM ;gD+ /'eD^ 



2(7^ ^ — ' 



Square loss 

The first term in the loss penalizes violations of order In 
fact, the first term evaluates to infinity if any of the bi- 
nary order constraints are violated. Hence, to maximize 
the log likelihood, the variables r must satisfy the order 
constraints {r„i y„j Vm S M}. This interpretation 
suggests a constrained optimization approach: 

mill i V {rm,i - i^m.if yrn e M (21) 



{r^ \r 



Note that this loss decomposes task-wise. Hence, the 
proposed approach results in a list-wise ranking model. 
We also note that the independence between tasks means 
that the optimization is embarrassingly parallel. 

The constrained score vectors {r„i|r„i y,,,} can 
be optimized efficiently using the inner representation 
outlined in Proposition |3] One issue that arises is that 
the cost function | [20) is not invariant to scale. Hence, the 
loss can be reduced just by scaling its arguments down. 
To avoid this degeneracy, we must constrain the score 
vectors away from 0. We achieve this by constraining the 
score vectors to the ordered simplex A^'", as it is a con- 
vex set and satisfies the requirement ^ A^'" . Applying 



Lemma |4j the score is given by 

for x„, e A^" 



9 



initialize tp, {x^}, {7m} 
repeat 

Update il>*\r by solving | (T2) . 
for all TO e M do 

Update 7m|i/'„i by block sorting (Lemma [Sj. 
Update a;,;j7„i, V'm by solving l [23) . 
end for 
until converged 
return ip, {x,n}, {j„i} 
Compute S using |(T0) 



Algorithm 1: Variational inference and maximum likeli- 
hood parameter estimation. 



Let ^p„^ e M^'" be the score vector ordered to satisfy 

V^,„ = [Mm,l)yi e D+}|{VK/')V/' e □„}]. The 

ordering of the score vector is not unique. The loss 
fimction can now be written as: 



min min - llvrfCa;) — ■0,, 
This is exactly equivalent to: 



y m G 



(22) 



(23) 



The equivalence can be shown simply by setting 7(-) = 
7r^^( ). We present both forms as it provides some flex- 
ibility when implementing the algorithm. We optimize 
( [23) by alternating optimization. We first optimize the 
vector and then optimize the permutation order 
7*„. The overall optimization combining the variational 
inference and maximum likelihood parameter estimation 
is presented in Algorithm [l] The probability vector x,n 
can be optimized efficiently using the exponentiated gra- 
dient (EG) algorithm |36J or other simplex-constrained 
least squares solvers and can be embarrassingly paral- 
lelized over the tasks Tm- Optimization of jm requires 
optimizing over all permutations of the vector. This 
may be naively solved by expensive enumeration or by 
solving a combinatorial assignment problem. 

)■ If xi > xi and 



Lemma 5 (Optimality of sorting [35 

II r^!! 1 rivi 1112 

> y2, then 
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y2. 




X2 
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X2 
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Vl 




X2 
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and 



Lemma |5] implies that selecting the pair-wise sorted 
order for equivalent items minimizes the loss. Lemma |5] 
can then be extended to y G M'^ using induction over d. 
Hence, selecting 7^ as the sorted ordering in each block 
of equivalent values {D+ , D~ } minimizes the loss. 

The set of compatible vectors as defined by Propo- 
sition |3] is a convex cone, i.e., the convex composition 
z — au + (1 — a)v, a G [0,1] of vectors in the set 
remains in the set, and any scaling az where a > 
preserves compatibility. To show convexity of parameter 
estimation, it remains to show that the combination of 
optimizing x„i and optimizing 7™ minimizes | [2T) . This 
is shown using the following lemma. 



Lemma 6 (Convexity of parameter estimation p5)). Let 

r be partitioned into two sets such that ri = {r^, V/c G D+} 
and r2 = {r^., Vfc G D^}, and let: 

= argmin 

r'-eU{ri) 
r[>r'2 

where n(ri) is the set of all permutations of the vector Vi 
and r'l > r'2 represents element wise inequality, then r* is 
isotonic with ztWi — 1,2. 

We can now consider the global properties of the vari- 
ational inference and maximum likelihood parameter 
estimation. 

Lemma 7 (Joint convexity). The variational inference and 
parameter estimation given by ^20^ is jointly convex in 
{■0, S, r}. Alternating optimization (Algorithm^ recovers the 
global optimum. 

Sketch of proof: Recall that squared loss is jointly 
convex in both of its arguments. In addition, the compo- 
nents {ijj, 5} and r are in separate convex sets. Hence to 
show global convexity, it is sufficient to show that l|20| 
is convex separately in {i^,S} (by Theorem [ij and r (by 
Lemma |6|. It follows from the joint convexity of pO) that 
the alternating optimization of Algorithm [T| recovers the 
global optimum. 

The proposed model is trained to estimate bipartite 
ranking scores for each task and the underlying mul- 
titask latent regression distribution. Item rankings are 
predicted by sorting the expected noise-free scores of the 
trained model E[zm.„|2?] — ip{m,n). 

5 Experiments 

This section details the experiments comparing the per- 
formance of the proposed model applied to the disease- 
gene prioritization task. We evaluated the modeling 
performance on association data curated from the OMIM 
database |37| by the authors of |17| and data we curated 
ourselves. We partitioned each dataset into five-fold 
cross validation sets. The model was trained on 4 of the 5 
sets and tested on the held out set. The results presented 
are the averaged 5-fold cross validation performance. 
Great care was taken to train all the models on the 
same datasets. Hence the results represent performance 
differences due to either the low rank modeling, the list- 
wise bipartite ranking model, or both. 

Baseline (ProDiGe |17|): We compared our proposed 
model to ProDiGe which, to the best of our knowledge, 
is the state of the art in the disease-gene prioritization 
literature. ProDiGe estimates the prioritization function 
using multitask support vector machines trained with 
gene kernel and disease kernel information. Parameter 
selection for ProDiGe was performed as suggested by 
the authors |17|. 

OMIM dataset: The OMIM dataset |37| is a curated 
database of known human disease-gene associations 
(4178 associations in the provided dataset). We derived 



10 



the gene-gene interaction graph using data from Hu- 
manNet [38]. We selected all genes with one or more 
connections in the network and all diseases with one or 
more genetic associations. This resulted in a disease-gene 
matrix with M = 3, 210 diseases, N — 13, 614 genes and 
T = 3, 636 known associations (data sparsity .0083%). In 
addition, the gene-gene graph contained 433, 224 known 
gene-gene links. We note the extreme sparsity of this 
matrix, and the resulting difficulty of the ranking task. 
Such sparse datasets are typical in the disease-gene 
domain. The OMIM dataset did not contain a disease 
graph; hence, we were unable to test the generalization 
of the methods to new diseases. 

Curated dataset: We curated a large disease-gene as- 
sociation dataset. The set of genes were defined using 
the NCBI ENTREZ Gene database [391, and the set of 
diseases were defined using the "Disease" branch of 
the NIH Medical Subject Heading (MeSH) ontology |40|. 
We extracted co-citations of these genes and diseases in 
the PubMed/ Medline database |41| to identify positive 
instances of disease-gene associations. We derived our 
gene-gene interaction graph using data from HumanNet 
[l38l and our disease-disease similarity graph from the 
MeSH ontology. This resulted in a set of 250, 190 ob- 
served interactions, 21, 243 genes and 4, 496 diseases. We 
selected all genes with one or more connections in the 
gene-gene graph and all diseases with with one or more 
connections in the disease-disease graph. This resulted 
in a dataset with M ~ 4, 495 diseases, N — 13, 614 genes 
and T — 224, 091 known associations (data sparsity 
0.36%). The resulting disease network contained 13,922 
links, and the gene network contained 433, 224 links. 

We were unable to run ProDiGe on the full dataset 
due to insufficient memory for storing the kernel matrix. 
Instead, we trained ProDiGe and the MV-GP models 
on a randomly selected 5% subsample of the associa- 
tions. We also provide results for the MV-GP models 
trained on the full dataset. We performed two kinds of 
experiments for the curated dataset. The first experiment 
(known diseases) tests the ranking ability of the model 
for associations selected randomly over the matrix. The 
second experiment (new diseases) tests the generaliza- 
tion ability of the model for new diseases not observed 
in the training set. For the known disease experiments, 
the cross validation associations were randomly selected 
over the matrix. For the new disease experiments, the 
cross validation was performed row-wise, i.e., we se- 
lected training set diseases and test set diseases. 

Model Setup: The proposed model was trained using 
the alternating optimization approach (Algorithm [ij. The 
trace constrained mean function was estimated using 
the cost function | [T6) . The model was trained using our 
implementation of the algorithm outlined in |42|. Like 
other large scale trace constrained matrix optimizers, 
[^2 1 maintains a low rank representation. The rank is 
estimated automatically by the optimizer We found that 
employing a row bias improved the model performance, 
so we learned row biases while training. Note that the 



row offsets do not change the ranking and hence are not 
required for testing. 

We selected the hyperparameter A = .s * Amax with 30 
values of s logarithmically spaced between 10^'^ and 1.0. 
Let i^(B) be the loss function. Then Amax is the maximum 
singular value of "^^J^^ Ib-o' optimization returns 
the zero matrix for any A > Amax- |30|. We used warm 



start to speed up the computation for decreasing values 
of s. We selected a e {1, 0.8, 0.6, 0.4, 0}. 

We implemented the full rank Gaussian process model 
(a = 0) by keeping the kernels as separate row and 
column kernels. This allowed us to scale the model to the 
larger datasets at the expense of more computations. We 
observed that the full rank model required a significant 
amount of computation time. This observation provides 
further motivation for the low rank approach. The full 
rank Gaussian process was trained directly using the 
Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. 

Sampling unknown negative items: Recall that the 
observed data only consists of known associations. Fol- 
lowing ProDiGe |17| we sampled "negative" observa- 
tions randomly over the disease-gene association matrix. 
We sampled 10 different negatively labeled item sets. All 
models are trained with the positive set combined with 
one of the negative labeled sets. The model scores are 
computed by averaging the scores over the 10 trained 
models. All algorithms were trained using the same 
samples. 

Covariance/Kernels: The covariances for the MV-GP 
prior and the kernels for ProDiGe were computed from 
gene graph and the disease graphs Gn ■ We performed 
preliminary experiments with a large class of graph 
kernels |43| and selected the exponential kernel. We 
briefly outline kernel generation for the gene kernel. Let 
A„ be the adjacency matrix for the gene-gene graph. 
We computed the normalized Laplacian matrix as = 
I — D^2 AmD^2, where I is the identity matrix and D 
is a diagonal matrix with entries D^i = (A„l),. The 
exponential kernel is given by KJ^ = exp (— L^,). Follow- 
ing the suggestion in p7[ and preliminary experiments, 
we observed an improvement in performance with the 
identity matrix added to the exponential kernel. Hence 
the final kernel is given by K„ = exp (— Lm) + I. The 
disease kernel generation was obtained using the same 
approach when a disease-disease graph was available. 
All algorithms were trained using the same kernel ma- 
trices. 

Metrics: Experimental validation of disease-gene as- 
sociations in a laboratory can be time consuming and 
costly, so only a small set of the top ranked predictions 
are of practical interest. Hence, we focus on metrics that 
capture the ranking behavior of the model at the top of 
the ranked list. In addition, all metrics are computed on 
the test set after removing all relevant genes that had 
been observed in the training set removed. All metrics 
are computed per disease and then averaged over all the 
diseases in the test set. Let Qi denote the labels of item 
(gene) / as sorted by the predicted scores of the trained 
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regression model, and let G„i = |D+ | = J2i l[g,=i] 
the total number of relevant genes for disease m in the 
test data after removing relevant genes observed in the 
training data. The metrics computed are as follows: 

1) Area under the ROC curve (AUC) |[19). This 
measures the overall ranking performance of the 
model. 

2) The precision at fc e {1, 2, . . . , 100} computes the 
fraction of relevant genes retrieved out off all re- 
trieved genes at position fc. 



k 



3) The recall at A; e {1,2,..., 100} computes the frac- 
tion of relevant genes retrieved out off all relevant 
genes. 



1= 



1 1[9, = 11 



min(G„,fc) ' 

4) Mean average precision at fc = 100 (MAPlOO) 
computed as the mean of the average precision at 
k = 100. The average precision is given as: 

.p E;=i l[g,^i]P@i 

At @k = ^77^ — p. — 

min(Gm, fc) 

The MAP® 100 metric was used for model selection over 
the cross validation runs. To reduce notation, MAP refers 
to MAP® 100 in all results. Higher values reflect better 
performance for the AUC, P@fe, R@k and MAP metrics, 
and their maximum value is 1.0. 






Fig. 5. Curated data (known diseases, 5% subsample) 
experiment results: precision (top) and recall (bottom) 
curves @fc = 1, 2, . . . , 100. Best and Trace curves overlap. 



5.1 Discussion 

We present performance results for ProDiGe, the stan- 
dard MV-GP {Hilbert, a — 0), the trace norm regularized 
MV-GP (Trace, a = 1), and the best overall MV-GP model 
(Best). 

Our first experiment was on the known disease pre- 
diction with the 5% subset of the curated data. This 
task is very challenging as the training data consisted 
of an average of less than 3 known associations out 
the possible 13,614 per disease. The difficulty of this 
task is reflected in the performance results shown in 
Table[T|and Fig.|5] We found that the trace model had the 
same performance as the best MV-GP model, suggesting 
that the trace norm is an effective regularization in this 
case. We found that the trace regularization resulted in a 
significant improvement in performance across metrics 
compared to ProDiGe and the Hilbert models. We also 
experimented with predicting the gene ranking of new 
diseases not seen during training and found similar 
performance as shown in Table |2] and Fig. |6] As this is 
new disease prediction, none of the known genes on are 
removed from the test diseases. Interestingly, we found 
that this seems to improve the model performance as 
compared to the in-matrix prediction. 

Next, we experimented with prediction on the full 
curated dataset predicting known diseases. We were un- 
able run this experiment with ProDiGe due to memory 



limitations. Hence, only results from the proposed model 
are shown. The results are as shown in Table|3]and Fig.|7] 
As expected, we observed a significant improvement in 
performance by using the entire dataset. Similar results 
were observed for the new disease prediction as shown 
in Table |4] and Fig. |8] In all models, we observed that 
the trace norm constrained approach out-performed the 
standard full rank MV-GP model. We especially note the 
performance improvement at the top of the list, as these 
are the most important to the domain. 

Our final experiment was on the OMIM dataset. The 
results are as shown in Table |5] and Fig. |9] These re- 
sults are especially interesting as we found that the 
best overall model outperformed the trace model, and 
significantly outperformed all other model in terms of 
ranking at the top of the list. This suggests that the 
spectral elastic net regularizer may be most useful with 
significant data sparsity. ProDiGe out-performed the 
Hilbert model in terms of recall, but Hilbert model had 
the best overall ranking performance as measured by 
AUC. We are investigating this observation further, but 
preliminary investigation suggests that the metrics are 
more sensitive to small changes in order when the data 
is very sparse. The sparsity also explains the significant 
drop in P@fc as k grows. 
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Fig. 6. Curated data (new diseases, 5% subsample) 
experiment results: precision (top) and recall (bottom) 
curves @fc = 1, 2, . . . , 100. Best and Trace curves overlap. 

TABLE 1 

Curated data experiment (known diseases, 5% 
subsample) avg. (std.) performance comparison. 



Fig. 7. Curated data (known diseases, full dataset) exper- 
iment results: precision (top) and recall (bottom) curves 
@k = 1,2, . . . , 100. Best and Trace curves overlap. 

TABLE 3 

Curated data experiment (known diseases, full dataset) 
avg. (std.) performance comparison. 





Best 


Trace 


Hilbert 


ProDiGe 


AUC 


0.793 (0.002) 


0.793 (0.002) 


0.687 (0.002) 


0.716 (0.001) 


MAP 


0.042 (0.003) 


0.042 (0.003) 


0.009 (0.001) 


0.003 (0.000) 


Pqioo 


0.065 (0.001) 


0.065 (0.001) 


0.028 (0.001) 


0.014 (0.000) 


Rqioo 


0.194 (0.001) 


0.194 (0.001) 


0.083 (0.003) 


0.039 (0.002) 





Best 


Trace 


Hilbert 


AUC 


0.869 (0.001) 


0.869 (0.001) 


0.782 (0.001) 


MAP 


0.054 (0.001) 


0.054 (0.001) 


0.006 (0.000) 


PiBlOO 


0.043 (0.000) 


0.043 (0.000) 


0.012 (0.000) 


RiBlOO 


0.241 (0.001) 


0.241 (0.001) 


0.100 (0.001) 



TABLE 2 

Curated data experiment (new diseases, 5% subsample) 
avg. (std.) performance comparison. 



TABLE 4 

Curated data experiment (new diseases, full dataset) 
avg. (std.) performance comparison. 

















Best 


Trace 


Hilbert 




Best 


Trace 


Hilbert 


ProDiGe 




AUC 


0.871 (0.009) 


0.871 (0.009) 


0.787 (0.015) 


AUC 


0.822 (0.014) 


0.822 (0.014) 


0.661 (0.018) 


0.716 (0.001) 




MAP 


0.080 (0.018) 


0.080 (0.018) 


0.013 (0.003) 


MAP 


0.047 (0.009) 


0.047 (0.009) 


0.013 (0.004) 


0.003 (0.000) 




Pqioo 


0.086 (0.021) 


0.086 (0.021) 


0.040 (0.010) 


Pqioo 


0.067 (0.014) 


0.067 (0.014) 


0.029 (0.009) 


0.014 (0.000) 




Rqioo 


0.255 (0.021) 


0.255 (0.021) 


0.125 (0.013) 


Rqioo 


0.200 (0.019) 


0.200 (0.019) 


0.078 (0.011) 


0.039 (0.002) 













6 Conclusion 

This paper proposes a novel hierarchical model for mul- 
titask bipartite ranking that combines a trace constrained 



matrix-variate Gaussian process and a bipartite ranking 
model. We showed that the trace constraint led to a mean 
function with low rank and discussed the spectral elastic 
net as the MAP regularizer that arises from this model. 
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Fig. 8. Curated data (new diseases, full dataset) exper- 
iment results: precision (top) and recall (bottom) curves 
@fc = 1, 2, . . . , 100. Best and Trace curves overlap. 

TABLE 5 

OMIM data experiment avg. (std.) performance 
comparison. 





Best 


Trace 


Hilbert 


ProDiGe 


AUC 


0.654 (0.028) 


0.649 (0.029) 


0.686 (0.016) 


0.524 (0.018) 


MAP 


0.041 (0.008) 


0.015 (0.002) 


0.001 (0.001) 


0.001 (0.000) 


Pqioo 


0.001 (0.000) 


0.001 (0.000) 


0.000 (0.000) 


0.000 (0.000) 


Rqioo 


0.097 (0.014) 


0.053 (0.018) 


0.009 (0.003) 


0.021 (0.005) 



We showed that constrained variational inference for the 
Gaussian process combined with maximum likelihood 
parameter estimation for the ranking model was jointly 
convex. We applied the proposed model to the priori- 
tization of disease-genes and found that the proposed 
model significantly improved performance over strong 
baseline models. 

We plan to explore the trace norm constrained MV- 
GP and the spectral elastic net further and analyze their 
theoretical properties. We also plan to explore parameter 
estimation using the resulting constrained posterior dis- 
tribution. In addition, we plan to investigate the applica- 
tions of the constrained MV-GP to other tasks including 
multitask regression and collaborative filtering. 



Fig. 9. OMIM data experiment results: precision (top) and 
recall (bottom) curves @fc = 1, 2, . . . , 100. 
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